Skip to content

feat(raster): view machinery for non-identity band views#813

Draft
james-willis wants to merge 18 commits into
apache:mainfrom
james-willis:jw/nd-raster-views
Draft

feat(raster): view machinery for non-identity band views#813
james-willis wants to merge 18 commits into
apache:mainfrom
james-willis:jw/nd-raster-views

Conversation

@james-willis
Copy link
Copy Markdown
Contributor

Summary

Reintroduces view machinery on top of #749 so callers can construct and read bands with non-identity view entries. #749 itself only writes the canonical identity (null view row) and rejects non-null views at read time; this PR re-enables the composition path that the slice/manipulation functions in #750 depend on.

Stack

What's in this PR

Single commit, three files:

  • rust/sedona-raster/src/traits.rs
    • validate_view() + 22 unit tests (length checks, axis range, negative step, broadcast / step=0, i64 overflow guards, etc.)
  • rust/sedona-raster/src/builder.rs
    • start_band_with_view() builder API
    • ~12 view-construction tests: slice, broadcast, transpose, negative-step, multi-band mixed views, Arrow IPC round-trip
  • rust/sedona-raster/src/array.rs
    • View → byte-stride composition in nd_buffer()
    • View-aware contiguous_data() Cow::Owned strided-copy slow path
    • Reader tests for explicit views (negative steps, OOB axis, length mismatch, malformed view rejection)

Behavior change vs #749

  • A band with a non-null view row now decodes via the composition path instead of being rejected as a read error. Identity-view (null view row) writer/reader behavior is unchanged.

Test plan

  • cargo build -p sedona-schema -p sedona-raster -p sedona-raster-gdal -p sedona-raster-functions -p sedona-testing
  • cargo test -p sedona-raster — 91 tests pass (includes the 22 validate_view tests, 12 builder view tests, 4 array view tests)
  • cargo clippy -p sedona-raster --all-targets -- -D warnings
  • cargo fmt --all --check

@github-actions github-actions Bot requested a review from paleolimbot May 5, 2026 00:21
@james-willis james-willis marked this pull request as draft May 5, 2026 16:58
@james-willis james-willis force-pushed the jw/nd-raster-views branch 2 times, most recently from 349c957 to 91966ed Compare May 5, 2026 18:14
Replaces apache#787's 2D-only band schema with the canonical N-D schema:
spatial_dims/spatial_shape at the raster level; bands carry dim_names,
source_shape, nullable view, outdb_uri, outdb_format, plus the
non-nullable data buffer. Removes nodata_value, storage_type,
outdb_url, and outdb_band_id - every one is encodable in the new
schema:

- storage_type ↔ outdb_uri.is_null() (null = InDb, set = OutDbRef).
- outdb_url ↔ outdb_uri (no rename, same string).
- outdb_band_id ↔ encoded inside outdb_uri (#band=N or GDAL native
  subdataset URI), parsed only inside the GDAL format driver.
- nodata_value ↔ typed nodata: Binary (a null row means "no nodata").

Top-level adds spatial_dims: List<Utf8View> and spatial_shape:
List<Int64>; nullable view is List<Struct<source_axis, start, step,
steps: Int64>> where a null row encodes the canonical identity view.

Note: intermediate commits in this PR are not expected to build; only
the PR tip is CI-green. The trait, reader/builder, RS_* migration,
and GDAL loader port land in subsequent commits.
RasterRef and BandRef accessors over the canonical N-D schema:
spatial_dims/spatial_shape, transform, crs, num_bands, band(i), and
band-level dim_names, source_shape, shape (visible, derived from view),
view, data_type, nodata, outdb_uri, outdb_format, nd_buffer,
contiguous_data returning Cow<[u8]>.

validate_view enforces all view rules including i64-overflow on
start + (steps-1)*step. NdBuffer exposes raw buffer + shape + byte
strides + offset for zero-copy access (numpy / Arrow C Data Interface
boundary); VIEW → byte strides happens inside nd_buffer().

Adds BandRef::is_2d() default method as the gate GDAL-backed paths
use to refuse N-D input cleanly: true iff dim_names == ["y","x"]
over the identity view.
… reader/builder + RS_* migration

View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption-
surgery (negative steps, bad source_axis, length mismatch) that
round-trips an ArrowError. Builder exposes start_raster / start_band
for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with
identity-view default written as a null view row. finish_raster
validates each band's visible shape against the raster's spatial_shape
along the spatial dims.

All 33 RS_* functions migrated mechanically; outputs on 2D inputs are
byte-identical to apache#787. RS_BandPath keeps its existing inline
fragment-stripping (format-agnostic display, untouched by the GDAL
parser). Test helpers in sedona-testing rewritten on the N-D builder
API.
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type /
outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates
on BandRef::is_2d() at entry and returns an Execution error on N-D
input. VSI normalization, the dataset cache, and RasterIO bodies are
byte-for-byte unchanged from apache#787 - only the schema-read sites move.

In-db reads use BandRef::contiguous_data() and require Cow::Borrowed
so MEM datasets can point at the StructArray's backing buffer without
copying; for is_2d identity views this always holds.

Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection
test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end
`raster_ref_to_gdal_mem` previously returned a `Result<Dataset>` and
guarded against `BandRef::contiguous_data()` returning `Cow::Owned`
with a runtime tripwire ("Internal: contiguous_data must be borrowed
for is_2d bands; got owned"). The check was correct — handing GDAL a
pointer into a `Vec<u8>` that drops at the end of the iteration would
dangle — but it ties an internal invariant ("`is_2d` ⇒ Borrowed") to
incidental properties of today's reader. Any future copy path in the
reader (compression, BinaryView block-boundary stitching, alignment
fix-up, sliced/broadcast/transposed views from apache#813 / apache#750) would
detonate the tripwire on perfectly valid 2-D rasters.

Change: return `Result<(Dataset, Vec<Vec<u8>>)>`. On `Cow::Borrowed`
the GDAL band still points directly at the StructArray buffer
(zero-copy). On `Cow::Owned` we move the `Vec<u8>` out of the Cow
without copying — the reader's existing materialization is the only
allocation — and stash it in the returned vector. The caller (the
provider in `gdal_dataset_provider.rs`) parks it in a new
`RasterDataset::_owned_band_bytes` field that lives as long as the
MEM dataset that holds the pointers.

`raster_ref_to_gdal_empty` discards the always-empty vector.
@james-willis james-willis force-pushed the jw/nd-raster-views branch from 91966ed to a7fd9cc Compare May 7, 2026 18:52
`BandRef::nd_buffer()` and `BandRef::contiguous_data()` previously
returned the empty Arrow `data` buffer when `is_indb() == false`,
giving consumers silent garbage instead of a signal that the band
needs a backend-specific OutDb resolver. Replace the silent path
with `ArrowError::NotYetImplemented`, documenting the integration
point for the future resolver work.

No existing call site in `sedona-raster-functions` reads bytes from
OutDb bands today (RS_BandPath only reads the URI), so this guard
turns a latent gap into a visible one without breaking behavior.
@james-willis james-willis force-pushed the jw/nd-raster-views branch 2 times, most recently from ba03f11 to f39f297 Compare May 12, 2026 16:39
…lity shim

Reintroduces the pre-N-D metadata surface as a parallel API layer so call
sites stay aligned with main and downstream branches don't have to chase
the N-D refactor while their work is in flight. Per Dewey's PR review
feedback, this is intentionally a compatibility scaffold — DB-15 tracks
the removal once the stack lands.

What the shim adds (`rust/sedona-raster/src/traits.rs`):
- `RasterMetadata` struct + `MetadataRef` trait (with a blanket
  `&T: MetadataRef` impl so `from_metadata(&m)` and
  `from_metadata(raster.metadata())` both compile)
- `BandMetadata` struct exposing `data_type()`, `nodata_value()`,
  `storage_type()`, `outdb_url()`, `outdb_band_id()`,
  `nodata_value_as_f64()` — `outdb_url`/`outdb_band_id` are eagerly
  parsed from the N-D `outdb_uri` via `split_outdb_band_fragment`
- `RasterRef::metadata() -> RasterMetadata` and
  `BandRef::metadata() -> BandMetadata` default impls
- `BandRef::data() -> &[u8]` for InDb identity views (panics elsewhere
  in PR-B; the `Cow::Owned` materialized-view case comes back with the
  view machinery in PR-D)
- `Bands<'a>` view + `RasterRefBandsExt` extension trait so callers can
  keep using `raster.bands().band(i)` / `.iter()` / `.len()`

Builder shim (`rust/sedona-raster/src/builder.rs`):
- `start_raster(&dyn MetadataRef, Option<&str>)` and
  `start_band(BandMetadata)` alongside the N-D-native
  `start_raster_nd` / `start_band_nd` (and the 2D positional convenience
  variants)
- Restored doc example using the `RasterMetadata{..}` / `BandMetadata{..}`
  literal pattern

`AffineMatrix` and GDAL adapters:
- `AffineMatrix::from_metadata<M: MetadataRef>(m: M)` (replaces
  `from_transform`) so `to_world_coordinate(raster)` / `rotation(raster)`
  /`to_raster_coordinate(raster)` match main
- Re-add `ToGdalGeoTransform` and `RasterMetadataFromGdalGeoTransform`
  helper traits in `gdal_common.rs`; `raster_ref_to_gdal_mem` reads
  `bands.band(i)` with `band.metadata().{data_type, nodata_value,
  storage_type}` and returns `Result<Dataset>` (no more owned-bytes
  plumbing — `band.data()` yields a borrowed slice straight out of the
  StructArray, kept alive via `RasterDataset._source_raster:
  PhantomData<&dyn RasterRef>`)
- `build_vrt_from_sources` / `raster_ref_to_gdal` / `VrtKey::from_raster`
  in `gdal_dataset_provider.rs` use 1-based `bands.band(i)` and the
  `BandMetadata` accessors; `VrtBandKey` gets its `storage_type` field
  back

Test fixtures and call sites: `rs_band_accessors`, `rs_bandpath`,
`rs_convexhull`, `rs_envelope`, `rs_example`, `rs_georeference`,
`rs_geotransform`, `rs_numbands`, `rs_pixel_functions`, `rs_setsrid`,
`rs_size`, `rs_spatial_predicates`, `rs_srid`, `rasters.rs`,
`benchmark_util.rs`, and the affine / display / GDAL tests all use the
metadata-struct surface that main exposes. The handful of PR-B-only
additions that remain are structural (the `RasterRefBandsExt` import on
the few generic helpers that need it, and the `is_2d()` guards / N-D
rejection tests on the GDAL backend).

Net effect on PR-B vs main: roughly 18 files, +3.2k / -1.5k, with most
of the inserts now sitting in the N-D-native reader/builder code and
nearly all of main's pre-N-D test bodies preserved verbatim.
…on cfg(test)

`utils.rs` is the canonical `Dataset -> SedonaDB raster` loader on main:
`append_as_indb_raster` and `dataset_to_indb_raster` are re-exported from
the crate root and are the public API consumers reach for. The earlier
N-D loader port deleted the file outright and replaced its callers with
code reaching into `BandRef::contiguous_data()` directly — a breaking
API change with no in-tree replacement. Restore the file verbatim from
main; the shim's `start_raster(&RasterMetadata{..}) /
start_band(BandMetadata{..})` already produces N-D-compatible 2D
rasters, so the existing body works unchanged. The only PR-B-only diff
is `use sedona_raster::traits::{RasterRef, RasterRefBandsExt};` in the
test module so `raster.bands()` resolves through our extension trait.
Brings 7 GDAL-loader tests back into the suite.

Also gate `mod source_uri;` on `#[cfg(test)]` to match main. With the
shim eagerly parsing the `#band=N` fragment via
`split_outdb_band_fragment` in `sedona-raster::traits`, the GDAL
backend no longer reaches for `parse_outdb_source`. Main already had
the module test-only for exactly this reason — neither
`#[allow(dead_code)]` nor a TODO is needed.
…me tie

Two structural cleanups that cascade through call sites and shave ~280
lines of divergence vs main:

1. Fold `RasterRefBandsExt::bands()` into `RasterRef` as a required
   trait method. Each `impl RasterRef for X` provides
   `fn bands(&self) -> Bands<'_> { Bands::new(self) }` (one line per
   impl, two impls total). Drops the extension trait, its blanket impl
   for `T: RasterRef`, its impl for `dyn RasterRef + 'r`, the six
   `use sedona_raster::traits::RasterRefBandsExt;` imports across the
   raster-functions / raster-gdal / testing crates, and the
   `+ RasterRefBandsExt` bound on the four generic GDAL-backend
   functions (`raster_ref_to_gdal_mem`, `raster_ref_to_gdal_empty`,
   `build_vrt_from_sources`, `raster_ref_to_gdal`,
   `VrtKey::from_raster`). Adds `Bands::new` as a pub constructor so
   `bands()` impls outside `traits.rs` can build the wrapper.

2. Rebuild `RasterRefImpl<'a>` to hold flat `&'a Array` references for
   every field instead of a back-pointer to `&'a RasterStructArray<'a>`.
   `RasterStructArray::get` goes back to `fn get(&self, idx) ->
   RasterRefImpl<'a>` (was `&'a self`), which removes the lifetime tie
   that was forcing every scalar-raster call site to hoist
   `RasterStructArray::new(...)` into a local. Reverts the
   `let arr0; arr0 = ...; if arr0.is_null(0) ...` plumbing in
   `executor.rs` and the two-line hoists in `sedona-testing/rasters.rs`.
   Restores `gdal_common::tests::single_raster` so the four
   `let rsa = RasterStructArray::new(&raster_array); let raster =
   rsa.get(0).unwrap();` test sites can go back to
   `let raster = single_raster(&raster_array);`.

Also revert `AffineMatrix::from_metadata` to main's `&dyn MetadataRef`
parameter — callers pass `&raster.metadata()` (one extra `&` versus
main, since our shim's `metadata()` returns `RasterMetadata` by value
rather than `&dyn MetadataRef`). The `impl<T: MetadataRef + ?Sized>
MetadataRef for &T` blanket impl that was carrying the by-value-generic
form goes with it.
`builder.rs` and `array.rs` each had their test modules entirely rewritten
during the N-D port, with main's tests deleted and replaced by N-D-flavoured
variants. The shim makes main's tests buildable as-is. Restore them
verbatim and keep the PR-B-only N-D tests alongside:

- builder.rs: `test_iterator_basic_functionality`, `test_multi_band_iterator`,
  `test_copy_metadata_from_iterator`, `test_band_data_types`,
  `test_outdb_metadata_fields`, `test_band_access_errors`. Adds
  `use crate::traits::{BandMetadata, MetadataRef};` at the top so `start_band`
  / `start_raster` can drop their fully-qualified `crate::traits::` paths
  and match main's call shape, and the test module pulls `BandMetadata` /
  `RasterMetadata` / `StorageType` in alongside the existing N-D imports.
- array.rs: `test_array_basic_functionality`, `test_multi_band_array`,
  `test_raster_is_null`. The fourth main test
  (`test_invalid_band_metadata_returns_err`) doesn't translate — it pokes a
  nested `metadata` Struct that doesn't exist in our flat N-D layout — and
  the equivalent corruption check is already covered by
  `band_and_band_data_type_return_none_for_unknown_discriminant`.

Two cleanups that fall out:

- `Bands::band(0)` and out-of-range errors now use main's exact wording
  ("band numbers must be 1-based" / "is out of range") so
  `test_band_access_errors` passes without a tweak.
- `BandRefImpl::data()` is now an explicit override that returns the raw
  `data` column bytes (matching main: `&[]` for OutDb, the row-major buffer
  for InDb identity views) instead of routing through `nd_buffer()` and
  panicking on OutDb. The shim's default `BandRef::data()` is kept for
  other implementors.

PR-B total: 10 files +3,286 / -897 (was +3,260 / -1,985 before this round).
Second-round review punch list — small, surgical:

- `builder.rs` imports: merge the two `sedona_schema::raster::{...}`
  imports into one, recombine `arrow_schema::{ArrowError, DataType}`
  (the lone `use arrow_schema::DataType;` line was unnecessary), drop
  the duplicate blank lines around the imports block.
- `start_raster_2d` doc-comment was the stitched-together result of two
  drafts ("Sets `spatial_dims=...`, and Build the 6-element GDAL
  transform..." with a self-reference at the end). Rewrite as a single
  coherent sentence and point at `Self::start_raster` instead of
  `Self::start_raster_2d`.
- `MetadataRef`: restore main's per-method doc comments
  (`/// Width of the raster in pixels` etc.) — they were dropped on the
  N-D rewrite for no N-D reason since this trait is the pre-N-D shim.
- `array.rs`: drop the three `// ----` banner blocks ("Band
  implementation", "Raster implementation", "RasterStructArray …") that
  weren't in main; the doc-comment on each type already names it. Also
  drop the `Critical #N` / `Important #N` prefixes from the five
  test-section dividers — they encode review prioritization in source.
- `sedona-schema/src/raster.rs`: `StorageType` repr back to `#[repr(u16)]`
  to match main; the only existing caller casts `as u32` explicitly so
  binary layout doesn't matter here.
- `traits.rs`: drop the unused `PartialEq` derive on `RasterMetadata`
  and `BandMetadata` — main has bare `#[derive(Debug, Clone)]` on both
  and no workspace consumer compares either via `==` / `assert_eq!` on
  the values themselves.

The other big finding (drop the inherent `impl RasterMetadata { ... }`
in favour of `MetadataRef` trait imports at every call site) was a wash
once accounted for — 25 inherent-method lines vs ~10 added trait imports
across ~8 files, with the structural cost that every downstream
consumer of `RasterMetadata.width()` style would need the trait import
forever. Kept the inherent block.

PR-B total: 10 files +3,270 / -888.
CI's codespell hook flagged two occurrences in `traits.rs` doc comments
(`MetadataRef` and `RasterRef::bands` definitions).
- `RasterRefImpl::band()`: change the non-null view row case from a
  silent `return None` to an `assert!`, surfacing the corrupt-schema
  invariant rather than hiding it behind an out-of-range-looking miss.
- `BandRef::data()` (trait default): document as a compatibility shim,
  delegate to `contiguous_data()` and panic on `Cow::Owned` (the
  view-materialized path can't be returned through `&[u8]` since the
  owned `Vec` would die at the end of the call). Implementors that
  need view-materialized bytes via `data()` must override and anchor
  the materialized buffer on `Self`; other consumers should reach for
  `contiguous_data()` directly.
- `BandRefImpl::data()` (concrete override): keeps the one-line raw
  bytes return; correct by construction since the upstream `assert!`
  rejects non-identity views before BandRefImpl construction. GDAL
  backend in `raster_ref_to_gdal_mem` keeps using this directly —
  no `contiguous_data()` plumbing until view support actually lands.
- `traits::nodata_bytes_to_f64`: drop `pub`. The only external surface
  is the lossless wrapper (`nodata_bytes_to_f64_lossless`); the lossy
  variant is now an internal fallback for non-i64/u64 dispatch.
  `sedona-raster-gdal::gdal_common::nodata_bytes_to_f64` is a separate
  `Option<f64>`-returning helper, unaffected.
- `traits.rs`: drop the `test_nodata_as_f64_int64_loses_precision_above_2_pow_53`
  test. With `nodata_bytes_to_f64_lossless` as the canonical path,
  locking in the lossy fallback's rounding behavior on out-of-mantissa
  Int64 documents an implementation detail rather than a contract.
…ion_err

Use the SedonaDB-internal-error macro as the assert message so the
non-identity view panic carries the standard "file a bug report"
pointer and backtrace, matching how the rest of the codebase signals
internal invariant violations.
Re-enables non-null `view` rows in the N-D raster reader. PR-B treats a
non-null view as a hard read error because its only writer (start_band)
emits the canonical identity (null view row); the view → byte-stride
composition path is needed by the slice/manipulation functions
(RS_Slice, RS_SliceRange, RS_DimToBand, RS_BandToDim) that land on
top of this PR.

Reintroduced:

- traits.rs: validate_view() + 22 unit tests.
- builder.rs: start_band_with_view() builder API and ~12
  view-construction tests (slice, broadcast, transpose, negative-step,
  IPC round-trip, etc.).
- array.rs: view → byte-stride composition in nd_buffer(), view-aware
  contiguous_data() Cow::Owned strided-copy slow path, and the array
  reader tests for explicit views (negative steps, OOB axis, length
  mismatch, malformed view rejection).

Identity-view writer/reader behavior is unchanged; PR-B's error path on
non-null view rows is replaced by the composition path validated here.
…terialized bands

With non-identity views now supported by the reader, BandRef::contiguous_data()
may return Cow::Owned(Vec<u8>) for sliced, broadcast, or permuted bands —
the reader materializes those into a fresh Vec rather than borrowing the
StructArray buffer in place. GDAL holds a raw pointer into that Vec for as
long as the MEM dataset lives, so the Vec must outlive the dataset.

raster_ref_to_gdal_mem now returns (Dataset, Vec<Vec<u8>>): the second
element captures any Cow::Owned band bytes moved out of contiguous_data()
without an extra copy. raster_ref_to_gdal_empty unpacks the tuple,
debug-asserts the owned Vec is empty (zero-band call), and returns the
dataset as before. raster_ref_to_gdal threads the owned-bytes Vec through
all four return arms into RasterDataset's restored _owned_band_bytes
field, which the dataset's Drop ordering ensures lives at least as long
as the MEM dataset whose pointers it backs.
…crate)

Both functions are internal implementation details — validate_view runs
implicitly inside start_band_with_view (writer) and RasterRef::band
(reader), and visible_shape_from_view is only needed at band-construction
time. External callers go through the builder/reader, which validate
on their behalf, and read shape back through BandRef::shape().
…ceCell

Two related changes:

1. Extract the strided-walk body shared between contiguous_data and the
   non-identity view path of data() into a single `materialize_strided`
   module-private helper. contiguous_data now delegates to nd_buffer
   for the OutDb check and identity-view fast path, then calls the
   helper for the strided copy.

2. Override BandRef::data() on BandRefImpl to handle non-identity views
   correctly. Previously the override returned the raw column bytes
   verbatim — fine for identity views (the column bytes ARE the visible
   bytes) but wrong for non-identity views (would return the source
   buffer, not the strided projection). The new impl borrows directly
   for identity views, materializes through a OnceCell<Vec<u8>> cache
   for non-identity views, and returns &[] for OutDb (matching main's
   pre-N-D behavior). Repeated .data() calls reuse the cached buffer.

Together these defuse the .data() footgun for view-materialized bands
without changing any caller. The fallible-Result surface stays on
contiguous_data and nd_buffer; data() panics on materialization failure
because validate_view ran at band construction and an arithmetic
overflow during the walk would indicate corruption beyond what
validate_view catches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant